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Ph Abstract 



In this paper, we establish a robustification of an on-line algorithm for modelling asset prices within a 
hidden Markov model (HMM). In this HMM framework, parameters of the model are guided by a Markov 
chain in discrete time, parameters of the asset returns are therefore able to switch between different regimes. 
The parameters are estimated through an on-line algorithm, which utilizes incoming information from the 
market and leads to adaptive optimal estimates. We robustify this algorithm step by step against additive 
outliers appearing in the observed asset prices with the rationale to better handle possible peaks or missings 
in asset returns. 

d Keywords: Robustness, HMM, Additive outlier, Asset pricing 
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1 Introduction 

JZ. Realistic modelling of financial time series from various markets (stocks, commodities, interest rates etc.) in 

recent years often is achieved through hidden Markov or regime-switching models. One major advantage of 
regime-switching models is their flexibility to capture switching market conditions or switching behavioural 

CN aspects of market participants resulting in a switch in the volatility or mean value. 

Regime-switching models were first applied to issues in financial markets through Hamilton [1989], where 
he established a Markov switching AR-model to model the GNP of the U.S. His results show promising effects 
of including possible regime-switches into the characterisation of a financial time series. A lot of further ap- 
^ proaches to use regime-switching models for financial time series followed, e.g. switching ARCH or switching 

GARCH models (see for example Cai [1994] and Gray [1996]), amongst many other applications. 



H 

Various algorithms and methods for statistical inference arc applied within these model set-ups, including 

as famous ones as the Baum- Welch algorithm and Viterbi's algorithm for an estimation of the optimal state 
sequence. HMMs in Finance, both in continuous and in discrete time often utilise a filtering technique which 
was developed by Elliott [1994]. Adaptive filters are derived for processes of the Markov chain (jump process, 
occupation time process and auxiliary processes) which are in turn used for recursive optimal parameter esti- 
mates of the model parameters. This filter-based Expectation-Maximization (EM) algorithm leads to an on-line 
estimation of model parameters. Our model set-up is based on Elliott's filtering framework. 

This HMM can be applied to questions, which arise in asset allocation problems. An investor typically has to 
decide, how much of his wealth shall be invested into which asset or asset class and when to optimally restructure 
a portfolio. Asset allocation problems were examined in a regime-switching setting by Ang and Bekaert [2002] , 
where high volatility and high correlation regimes of asset returns were discovered. Guidolin and Timmcrmann 
[2007] presented an asset allocation problem within a regime-switching model and found four different possible 
underlying market regimes. A paper by Sass and Haussmann [2004] derives optimal trading strategies and 
filtering techniques in a continuous-time regime-switching model set up. Optimal portfolio choices were also 
discussed in Elliott and van der Hoek [1997] and Elliott and Hinz [2003] amongst others. Here, Markowitz's 



famous mean- variance approach (see Markowitz [1952]) is transferred into an HMM and optimal weights are 
derived. A similar Markowitz based approach within an HMM was developed in Erlwein et al. [2011], where 
optimal trading strategies for portfolio decisions with two asset classes are derived. Trading strategies are 
developed herein to find optimal portfolio decision for an investment in either growth or value stocks. Elliott's 
filtering technique is utilised to predict asset returns. 

However, most of the optimal parameter estimation techniques for HMMs in the literature only lead to rea- 
sonable results, when the market data set does not contain significant outliers. The handling of outliers is an 
important issue in many financial models, since market data might be unreliable at times or high peaks in 
asset returns, which might occur in the market from time to time shall be considered separately and shall not 
negatively influence the parameter estimation method. In general, higher returns in financial time series might 
belong to a separate regime within an HMM. This flexibility is already included in the model set-up. However, 
single outliers, which are not typical for any of the regimes considered, shall be handled with care, a separate 
regime would not reflect the abnormal data point. In this paper, we will develop a robustification of Elliot's 
filter-based EM-algorithm. In section 2 we will set the HMM framework, which is applied (either in a one- 
or multi-dimensional setting) to model asset or index returns. The general filtering technique is described in 
section 3. The asset allocation problem which clarifies the effect outliers can have on the stability of the filters 
is developed in section 4. Section 5 then states the derivation of a robustification for various steps in the filter 
equations. The robustification of a reference probability measure is derived as well as a robust version of the 
filter-based EM-algorithm. An application of the robust filters is shown in section 6 and section 7 finishes our 
work with some conclusions and possible future applications. 



2 Hidden Markov model framework for asset returns 

For our problem setting we first review a filtering approach for a hidden Markov model in discrete time which 
was developed by Elliott [1994]. The logarithmic returns of a stock or an index follow the dynamics of the 
observation process y^ , which can be interpreted as a discretized version of the Geometric Brownian motion, 
which is a standard process to model stock returns. The underlying hidden Markov chain x^ cannot be directly 
observed. The parameters of the observation process are governed by the Markov chain and are therefore able 
to switch between regimes over time. 

We work under a probability space (U,,T 1 P) under which Xj, is a homogeneous Markov chain with finite 
state space / = {1,...,N} in discrete time (k — 0,1,2...). Let the state space of X& be associated with 
the canonical basis {ei,e2, ...,ejv} <5 R N with ej = (0, ..., 0, 1, 0, ...,0) T <G M. N . The initial distribution of 
xo is known and II = (iTji) is the transition probability matrix with nji = P(xfe + i = e^x^ = e^). Let 
J 7 * = ct{xo, ...,Xfc} be the a -field generated by xo, ...,Xfc and let J 7 * be the complete filtration generated by 
J 7 * . Under the real world probability measure P, the Markov chain x has the dynamics 

Xfe+i = IIx fe +v k+1 (2.1) 

where v^ + i := x^+i — Ilxfe is a martingale increment (see Theorem in Elliott [1994]). 

The Markov chain x,t is "hidden" in the log returns yu+i of the stock price Sk- Our observation process 
is given by 

y k+ i = In -— - = /(x fe ) + cr(x fe )wfe+i (2.2) 

where x^ has finite state space and w k 's constitute a sequence of i.i.d. random variables independent of x. 
The real-valued process y can be re-written as 

Uk+i = (/,Xfe) + (<r,x fc ) Wk+i- (2.3) 

Note that f = (/i,/2, ■••, /jv) and 0" = (oi, c 2 , ...,<7tv) t are vectors, furthermore /(x/,) = (f, x^) and 
cr(x^) = (<r, Xfc), where (b, c) denotes the Euclidean scalar product in R^ of the vectors b and c. We 
assume a t ^ 0. Let T\ be the filtration generated by the cr(yi, j/2, ■■■,Vk) and Tk = F% v ^k IS * ne global 



filtration. 

The following theorem (Elliott [1994]) states that the dynamics of the underlying Markov chain can be de- 
scribed by martingale differences. 

3 Essential Steps in Elliott's Algorithm 

3.1 Change of Measure 

A widely used concept in filtering applications, going back to Zakai [1969] for stochastic filtering, is a change 
of probability measure technique. A measure change to a reference measure P is applied here, under which 
filters for the Markov chain and related processes are derived. Under P, the underlying Markov chain still has 
the dynamics x^ +1 = Ilxfe + ~Vk+i but is independent of the observation process and the observations y k are 
A/"(0, 1) i.i.d. random variables. 

Following the change of measure technique which was outlined in Elliott et al. [1995] the adaptive filters for the 
Markov chain and related processes are derived under this "idealised" measure P. Changing back to the real 



world is done by constructing P from P through the Radon-Nikodym derivative 
Afc we define the process A; 

h '■= 



dp 
dp 



= Afc. To construct 



(3.1) 



<j{-x.i-i)4>{yi) 

where 4>(z) is the probability density function of a standard normal random variable Z and set Afe := 
n /=1 A/, k > 1, Aq = 1 . Under P the sequence of var 
dard normals, where we have w k+ i = (T ( x fe) X (j/fe+i — /( x fc)) ■ 



ri/=i ^h k > 1, Aq = 1 . Under P the sequence of variables u>i,W2, ■ ■ ■ , is a sequence of i.i.d. stan- 



3.2 Filtering for general adapted processes 

The general filtering techniques and the filter equations which were established by Elliott [1994] for Markov 
chains observed in Gaussian noise are stated in this subsection. This filter-based EM-algorithm is adaptive, 
which enables fast calculations and filter updates. Our robustification partly keeps this adaptive structure of 
the algorithm, although the recursivity cannot be kept completely. 

In general, filters for four types of processes related to the Markov chain, namely the state space process, the 
jump process, the occupation time process and auxiliary processes including terms of the observation process 
are derived. Information on these processes can be filtered out from our observation process and can in turn 
be used to find optimal parameter estimates. 

To determine the expectation of any J 7 — adapted stochastic process H given the filtration .Fjff, consider the 
reference probability measure P defined as P(A) = J A AdP . From Baycs' theorem a filter for any adapted 

process H is given by E [H k \ F y k ] = E[H k A k | J*] / E[k k | Jf] . We define Vk (H k ) := E[H k A k \ J%], so 

that E[H k | J 7 !] = rj k (H k ) / rj k (l). A recursive relationship between r/ k (H k ) and r)k-i(Hk-i) nas to be found, 
where r)o(H ) = E[H ]. However, a recursive formula for the term ri k ^i(H k _ix k _i) is found. To relate r/ k (H k ) 
and r/ k (H k x k ) we note that with (l,Xfe) = 1 



(l,»fc(#ifcx fc )) = %(Ffe(l,x fe )) = r] k (H k ). (3.2) 



Therefore 



fTw- i r-y] - (MfcCgfc^fc)) coon 

^'^ d,*(x fe )) • (3 ' 3) 

A general recursive filter for adapted processes was derived by Elliott [1994]. Suppose Hi is a scalar 

T = ct((x(, Y t ) t )— adapted process, Hq is T* measurable and Hi = ff;_i + a; + (bi, V;) + gi,f(yi), where a, b 



and g are J-" -predictable, / is a scalar-valued function and V; = x; — Ilx(_i. A recursive relation for rj k (H k x k ) 
is given by 

N 

i] k (H k x k ) = y^ V{y k ) [(ei,T?fc-i(gfe_iXfe_i))IIei 
i=i 
+(ej,%_i(afeX fe _i))IIej 

+(diag(ilei) - (nej)(Hei)')%-i(M e J> x fc-i)) 

+rik-i(gk(ei,x k -i))f(y k )Ile i ] (3.4) 

Here, for any column vectors z and y, zy' denotes the rank-one (if z 7^ and y ^0) matrix zy T . The term 
V(y k ) denotes the component-wise Radon-Nikodym derivative A^ : 

r l (yk) = <f>{ -)/vi<f>(yk) 

Now, filters for the state of the Markov chain as well as for three related processes: the jump process, 
the occupation time process and auxiliary processes of the Markov chain are derived. These processes can be 
characterised as special cases of the general process Hi . 

The estimator for the state x^ is derived from r) k (H k x k ) by setting H k = Hq = 1, a k = 0, b k = and 
g k = 0. This implies that 

N 

i] k (x k ) = ^r 4 (2/ fc )(e i ,77 fe _i(x fe _i))riej . (3.5) 

The first related process is the number of jumps of the Markov chain x. k from state e r to state e s in time 
k, J k = X)f =1 (xj_i,e r )(xj,e 8 ). Setting H k = j[ ] ,H = 0, a k = (x fe _i, e r )n sr , b k = (x fe _ 1 ,e r )e 8 and 
g k = in equation (3.4) we get 

N 

77fc(Jfx fc ) = ^P(y fe )(r? fe _i(J|I 1 x fe _ 1 ),e i )ne i 

i=l 

+r r (y fe )?7 fe _i((x fe -i,e r .))7r sr e s . (3.6) 

(r) 

The second process O k denotes the occupation time of the Markov process x , which is the length of time x 
spent in state r up to time k. Here, O k — J2i=i( x i-ii e r) = O r k _ 1 + (xfe_i,e r ) . We set H k = O r kl H = 0, 
a k = (xfe_i,e r ) , b k — and g k = in equation (3.4) to obtain 

JV 

Vk(O r k x k ) = ^r J (y fe )(77 fe _i(0^_ 1 x fc _i),e i )ne i 

i=l 

+r r (y fc )(%-i(x fe _ 1 ),e r )ne r . (3.7) 

Finally, consider the auxiliary process T k (<?) , which occur in the maximum likelihood estimation of model 
parameters. Specifically, T k (g) — X^=i( x '-ii e r)9(2/z)j where g is a function of the form g(y) = y or 
g(y) — V 2 ■ We apply formula (3.4) and get 

N 

Vk(T k {g)x k ) = ^2r t (y k )(i lk - 1 (T k ~_ 1 (g(y k - 1 ))x k - 1 ),e l )ne l 

+r r (y/ £ )(7? fc -i(xfe-i),e r ).g(y fc )ne r . (3.8) 

The recursive optimal estimates of J, O and T can be calculated using equation (3.2). 



3.3 Filter-based EM-algorithm 

The derived adapted filters for processes of the Markov chain can now be utilised to derive optimal parameter 
estimates through a filter-based EM-algorithm. The set of parameters p , which determines the regime-switching 
model is 

P = fai, l<i,j<N, fi, ai,l<i< N}. (3.9) 

Initial values for the EM algorithm are assumed to be given. Starting from these values updated parameter 
estimates are derived which maximise the conditional expectation of the log-likelihoods. The M-step of the 
algorithm deals with maximizing the following likelihoods: 
M-Step 



The likelihood in the global J 7 -model is given by 



i 



\ogA t (aJ;(x s ,y s ) s < t ) = -^E (log(a,x s _i) 

8=1 



fc-(/,x s -i)) ; 



• In the T y -model, where the Markov chain is not observed, we obtain 



Lt{e,f\{y s )s<t =E[logA t (o-,/;(x a ,j/ s ) g < t ) | F t ] = 

1 N 
-o E {^^Ot + CW) " 2f t \y)f k + 6 k t f)/a£) (3.10) 



fc=i 



The maximum likelihood estimates of the model parameters can be expressed through the adapted filters. 
Whenever new information is available on the market, the filters are updated and, respectively, updated pa- 
rameter estimates can be obtained. 



Theorem 3.1 (Optimal parameter estimates) Write Hk = E\Hk\J-^.\ for any adapted process H. With 
J , O and T denoting the best estimates for the processes J , O and T , respectively, the optimal parameter 
estimates %ji,fi and <ji are given by 



4 l vk(JV) 



01 Vk(Ol) 

Jf = v(T (l Ky)h 
Qf v(OW) k 



(3.11) 
(3.12) 



T£\v*)-2fiTJ;\v) + ftd<z\ (313) 



\ o 



(i) 

k 



Proof T:he derivation of the optimal parameter estimates can be found in Elliott et al. [1995]. //// 

Summary The filter-based EM-algorithm runs in batches of n data points ( n typically equals to a minimum 
of ten up to a maximum of fifty) over the given time series. The parameters are updated at the end of each 
batch. Elliott's Algorithm comprises the following steps 

(0) Find suitable starting values for II and / , a. 

(RN) Determine the RN-derivativc for the measure change to P. 

(E) Recursively, compute filters J/ 1 , 0\ , and T^(g). 

(Ml) Obtain ML-estimators / — (/i, . . . , /jy) and a = (a\, . . . , &n)- 

(M2) Obtain ML-estimators II. 

(Rec) Go to (RN) to compute the next batch. 
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4 Outliers in Asset Allocation Problem 

4.1 Outliers in General 

In the following sections we derive a robustification of the Algorithm 3.3 to stabilize it in the presence of outliers 
in the observation process. To this end let us discuss what makes an observation an outlier. First of all, out- 
liers are exceptional events, occurring rarely, say with probability 5% -10%. Rather than captured by usual 
randomness, i.e., by some distributional model, they belong to what Knight [1921] refers to uncertainty: They 
are uncontrollable, of unknown distribution, unpredictable, their distribution may change from observation 
to observation, so they are non-recurrent and do not form an additional state, so cannot be used to enhance 
predictive power, and, what makes their treatment difficult, they often cannot be told with certainty from ideal 
observations. 

Still, the majority of the observations in a realistic sample should resemble an ideal (distributional) setting 
closely, otherwise the modeling would be questionable. Here we understand closeness as in a distributional 
sense, as captured, e.g., by goodness-of-fit distances like Kolmogorov, total variation or Hellinger distance. 
More precisely, ideally, this closeness should be compatible to the usual convergence mode of the Central Limit 
Theorem, i.e., with weak topology. In particular, closeness in moments is incompatible with this idea. 

Topologically speaking, one would most naturally use balls around a certain element, i.e., the set of all distri- 
butions with a suitable distance no larger than some given radius e > to the distribution assumed in the 
ideal model. 

Conceptually, the most tractable neighborhoods are given by the so-called Gross Error Model, defining a 
neighborhood U about a distribution F as the set of all distributions given by 

U c {F,s) = {G\3H: G={l-e)F + sH} (4.1) 

They can also be thought of as the set of all distributions of (realstic) random variables X T ° constructed as 

X TC = (1 - U)X id + UX di (4.2) 

where X id is a random variable distributed according to the ideal distribution and U is an independent Bin(l, e) 
switching variable, which in most cases lets you see X ld but in some cases replaces it by some contaminating 
or distorting variable X dl which has nothing to do with the original situation. 

4.2 Time-dependent Context: Exogenous and Endogenous Outliers 

In our time dependent setup in addition to the i.i.d. situation, we have to distinguish whether the impact of 
an outlier is propagated to subsequent observations or not. Historically there is a common terminology due to 
Fox [1972], who distinguishes innovation outliers (or IO's) and additive outliers (or AO's). Non-propagating 
AO's arc added at random to single observations, while IO's denote gross errors affecting the innovations. For 
consistency with literature, we use the same terms, but use them in a wider sense: IO's stand for general en- 
dogenous outliers entering the state layer (or the Markov chain in the present context), hence with propagated 
distortion. As in our Markov chain, the state space is finite, IO's are much less threatening as they are in general. 

Correspondingly, wide-sense AO's denote general exogenous outliers which do not propagate, hence also com- 
prise substitutive outliers or SO's as defined in a simple generalization of (4.2) to the state space context in 
equations (4.3)-(4.6). 

y ro = (1 - U)Y' d + UY d \ U ~ Bin(l, r) (4.3) 

for U independent of (X, Y ld ,Y dl ) and some arbitrary distorting random variable Y dl for which we assume 

Y di , X independent (4.4) 

and the law of which is arbitrary, unknown and uncontrollable. As a first step consider the set dU so (r) defined 
as 

dU so (r) = J£(X, r rc ) | r ro ace. to (4.3) and (4.4)} (4.5) 
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Figure 1: Optimal parameter estimates for monthly MSCI returns between 1994 and 2009 



Because of condition (4.4), in the sequel we refer to the random variables Y rc and Y di instead of their respective 
(marginal) distributions only, while in the common gross error model, reference to the respective distributions 
would suffice. Condition (4.4) also entails that in general, contrary to the gross error model, C(X, Y id ) is not 
element of dU so (r) , i.e., not representable itself as some C(X 1 Y 1 ") in this neighborhood. 



As corresponding (convex) neighborhood we define 

U so {r)= (J dU so {s) 



(4.6) 



hence the symbol "<9" in dU so , as the latter can be interpreted as the corresponding surface of this ball. Of 
course, U so (r) contains C(X, Y' d ) . In the sequel where clear from the context we drop the superscript SO 
and the argument r . 



Due to their different nature, as a rule, IO's and AO's require different policies: As AO's are exogenous, 
we would like to damp their effect, while when there are IO's, something has happened in the system, so the 
usual goal will be to track these changes as fast as possible. 



4.3 Evidence for Robustness Issue in Asset Allocation 

In this section we examine the robustness of the filter and parameter estimation technique. The filter technique 
is implemented and applied to monthly returns of MSCI index between 1994 and 2009. The MSCI World Index 
is one of the leading indices on the stock markets and a common benchmark for global stocks. The algorithm 
is implemented with batches of ten data points, therefore the adaptive filters are updated whenever ten new 
data points are available on the market. The recursive parameter estimates, which utilise this new information, 
are updated as well, the algorithm is self-tuning. Care has to be taken when choosing the initial values for the 
algorithm, since the EM-algorithm in its general form converges to a local maximum. In this implementation 
we choose the initial values with regard to mean and variance of the first ten data points. Figure 1 shows the 
original return series, optimal parameter estimates for the index returns as well as the one-step ahead forecast. 

To highlight the sensitivity of the filter technique towards exogenous outliers we plant unusual high re- 
turns within the time series. Considerable SO outliers are included at time steps t = 40, 80, 130, 140. The 
optimal parameter estimation through the filter-based EM-algorithm of this data set with outliers can be seen 
in Figure 2. The filter still finds optimal parameter estimates, although the estimates are visibly affected by 
the outliers. 
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Figure 2: Optimal parameter estimates for monthly MSCI returns with planted outliers 



In a third step, severe outliers are planted into the observation sequence. Data points t — 40, 80, 130, 140 now 
show severe SO outliers as can be seen from the first panel in Figure 3. The filters cannot run through any 
longer, optimal parameter estimates cannot be established in a setting with severe outliers. 

In practice, asset or index return time series can certainly include outliers from time to time. This might be 
due to wrong prices in the system, but also due to very unlikely market turbulence for a short period of time. 
It has to be noted, that the type of outliers which we consider in this study does not characterise an additional 
state of the Markov chain. In the following, we develop robust filter equations, which can handle exogenous 
outliers. 

5 Robust Statistics 

To overcome effects like in Figure 3 we need more stable variants of the Elliott type filters discussed so far. This 
is what robust statistics is concerned with. Excellent monographs on this topic are e.g., Huber [1981], Hampcl 
et al. [1986], Rieder [1994], Maronna et al. [2006]. This section provides necessary concepts and results from 
robust statistics needed to obtain the optimally-robust estimators used in this article. 



5.1 Concepts of Robust Statistics 

The central mathematical concepts of continuity, differentiability, or closeness to singularities may in fact serve 
to operationalize stability quite well already. To make these available in our context, it helps to consider a 
statistical procedure, i.e.; an estimator, a predictor, a filter, or a test as a function of the underlying distribution. 
In a parametric context, this amounts to considering functionals T mapping distributions to the parameter set 
O . An estimator will then simply be T applied to the empirical distribution F n . For filtering or prediction, 
the range of such a functional will rather be the state space but otherwise the arguments run in parallel. 

For a notion of continuity, we have to specify a topology, and as in case of outliers, we use topologies essen- 
tially compatible with the weak topology. With these neighborhoods, we now may easily translate the notions 
of continuity, differentiability and closest singularity to this context: (Equi-)continuity is then called qualitative 
robustness [Hampel et al., 1986, Sec. 2.2 Def. 3], a differentiable functional with a bounded derivative is called 
local robust, and its derivative is called influence function (IF) 1 , compare [Hampel et al., 1986, Sec. 2.1 Def. 1]. 



1 In mathematical rigor, the IF, when it exists, is the Gateaux derivative of functional T into the direction of the tangent 
S x — F . For certain properties, this notion is in fact too weak, and one has to require stronger notions like Hadamard or Frechet 
differentiability; for details, see Fernholz [1983] or [Rieder, 1994, Ch. 1], 
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Figure 3: Filter-based EM-algorithm for observation sequence with severe outliers 



The IF reflects the infinitesimal influence of a single observation on the estimator. Under additional assump- 
tions, many of the asymptotic properties of an estimator are expressions in the IF ip . E.g., the asymptotic 
variance of the estimator in the ideal model is the second moment of ip . Infinitesimally, i.e., for e — > 0, the 
maximal bias on U is just sup \i\)\ , where | • | denotes Euclidean norm, sup \ip\ is then also called gross error 
sensitivity (GES), [Hampel et al., 1986, (2.1.13)]. Seeking robust optimality hence amounts to finding optimal 
IFs. 

To grasp the maximal bias of a functional T on a neighborhood U = U(F;e) of radius e, one considers 
the max-bias curve e i— >• suPgeuIf-s) l-^(^) — T(F)\ ■ The singularity of this curve closest to (i.e., the ideal 
situation of no outliers) captures its behavior under massive deviations, or its global robustness. In robust 
statistics, this is called breakdown point — the maximal radius e the estimator can cope with without producing 
an arbitrary large bias, see [Hampel et al., 1986, Sec. 2.2 Def.'s 1,2] for formal definitions. 

Usually, the classically optimal estimators (MLE in many circumstances) are non-robust, both locally and 
globally. Robust estimators on the other hand pay a certain price for this stability as expressed by an asymptotic 
relative efficiency (ARE) strictly lower than 1 in the ideal model, where ARE is the ratio of the two asymptotic 
(co)variances of the classically optimal estimator and its robust alternative. 

To rank various robust procedures among themselves, other quality criteria are needed, though, summarizing 
the behavior of the procedure on a whole neighborhood, as in (4.1). A natural candidate for such a criterion 
is maximal MSE (maxMSE) on some neighborhood U around the ideal model and, for estimation context, 
maximal bias (maxBias) on the respective neighborhood, or, referring to famous [Hampel, 1968, Lemma 5], 
trace of the ideal variance subject to a bias bound on this neighborhood. In estimation context, the respective 
solution are usually called OMSE (Optimal MSE estimator), MBRE (Most iJias -Robust .Estimator), and 
OBRE (Optimally Bias -Robust Estimator) 2 . 

In our context we encounter two different situations where we want to apply robust ideas: (recursive) 
filtering in the (E)-step and estimation in the (M)-step. While in the former situation we only add a single new 
observation, which precludes asymptotic arguments, in the (M)-step, the preceding application of Girsanov's 
theorem turns our situation into an i.i.d. setup, where each observation becomes (uniformly) asymptotically 
negligible and asymptotics apply in a standard form. 

5.2 Our Robustification of the HMM: General Strategy 

As a robustification of the whole estimation process in this only partially observed model would (a) lead to 
computationally intractable terms and (b) would drop the key feature of recursivity, we instead propose to 
robustify each of the steps in Elliott's algorithm separately. Doing so, the whole procedure will be robust, 



2 For the terms OBRE and MBRE, see Hampel et al. [1986], while for OMSE see Ruckdeschel and Horbenko [2010]. 
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but in general will loose robust optimality, i.e.; contrary to multiparameter maximum likelihood, the Bcllmann 
principle does not hold for optimal robust multi-step procedures simply because optimal clipping in several 
steps is not the same as joint optimal clipping of all steps. Table 1 lists all essential steps in Elliot's algorithm 
related to our proposed robustification approach. 



Classical setting Robust version 

Initialization: Find suitable starting values for II, /, a and Xq. 



Build TV clusters on first batches 



Use first and second moment of each 
cluster as initial values for / and a. 
Choose II and Xo according to 
cluster probabilities. 



Build N + 1 clusters on first batches, 

distribute points in outlier cluster randomly 

on other clusters. 

Use median and MAD of clusters for 

/ and a. 

Choose II and Xo according to 

cluster probabilities. 



E-step: Determine RN-derivative and calculate E-Step. 



Find RN-dcrivative. A 



Estimate recursive filters 
Jj\ Ol and Tl{g). 



Robustificd version of A 

through suitable clipped version of A&. 

No further robustification needed; i.e., 

take over Jf, 1 , 0\ unchanged and skip Tfc(g). 



M-step 1: Obtain estimates for / and a. 



MLE-estimates for / and a through 

recursive filters O l k and TJ,(g). 

Recursive filters are substituted into likelihood. 

Estimates updated after each batch. 



Likelihoods re-stated; they are expressed 

as weighted sums of the observations y^. 

Robustified version of MLE through asymptotic 

linear estimators. 

Estimates updated after each batch, 

recursivity cannot be preserved completely 



M-step 2: Obtain ML-estimators II. 



MLE-estimation, recursive filters J k and 
0\ are substituted into likelihood. 



Robustification through robust version of filters. 
J k and 0|, no further observation y^ 
has to be considered. 



Rec: Algorithm runs on next batch. 



Go to (RN) to compute the next batch. 



Go to (RN) to compute the next batch. 



Table 1: Classical algorithm setting and robustified version of each step. 



5.3 Robustification of Step (0) 

So far, little has been said as to the initialization even in the non-robustified setting. Basically, all we have to do 
is to make sure that the EM algorithm converges. In prior applications of this algorithms (Mamon et al. [2008] 
and Erlwein et al. [2011] amongst others), one approach was to fill II with entries 1/N , i.e., with uniform (and 
hence non-informative) distribution over all states, independent from state x . As to /, and a^ , an ad hoc 
approach would estimate the global mean and variance over all states and then, again in a non-informative way 
jitter the state-individual moments, adding independent noise to it. In our preliminary experiments, it turned 
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out that this naive approach could drastically fail in the presence of outliers, so we instead propose a more 
sophisticated approach which can also be applied in a classical (i.e., non- robust) setting: In a first step we ignore 
the time dynamics and interpret our observations as realizations of a Gaussian Mixture Model, for which we 
use R package mclust (Fralcy and Raftery [2002], Fraley et al. [2012]) to identify the mixture components, and 
for each of these, we individually determine the moments /$ and <7j . As to II , we again assume independence 
of Xo , but fill the columns according to the estimated frequencies of the mixture components. In case of the 
non-robust setting we would use N mixture components and for each of them determine fi and <j{ by their ML 
estimators (assuming independent observations). For a robust approach, we use N+l mixture components, one 
of them — the one with the lowest frequency — being a pure noise component capturing outliers. For each non- 
noise component we retain the ML estimates for fi and c^ . The noise component is then randomly distributed 
amongst the remaining components, respecting their relative frequencies prior to this redistribution. 

We are aware of the fact that reassigning the putative outliers at random could be misleading in ideal 
situations (with no outliers) where one cluster could be split off into two but not necessarily so. Then in our 
strategy, the smaller offspring of this cluster would in part be reassigned to wrong other clusters, so this could 
still be worked on. On the other hand, this choice often works reasonably well, and as more sophisticated 
strategies are questions for model selection, we defer them to further work. 

Based on the fi and tJj , for each observation j and each state % , we get weights < Wij < 1 , ^2 { Wij = 1 
for each j , representing the likelihood that observation j is in state i . For each i , again we determine 
robustified moment estimators f[ , a[ as weighted medians and scaled weighted MADs (medians of absolute 
deviations). 

Weighted Medians And MADs: For weights Wj > and observations yj , the weighted median 
m = m(y, w) is defined as m = argmin^ J^ . Wj\yj — f\ , and with y'- = \yj — m\ , the scaled weighted MAD 
s = s(y,w) is defined as s — c _1 argmin t J^. Wj\y'j — t\ , where c is a consistency factor to warrant consistent 

estimation of a in case of Gaussian observations, i.e., c = argmin t EJ^ • Wj \\y~j\ — t\ for yj ~' A/"(0, 1) . c 
can be obtained empirically for a sufficiently large sample size M , e.g., M = 10000, setting c = jj X)fe=i c k i 
c fe = argmin t *£j w j |b",fcl ~ t \ » v'ik l '~' -^{0, 1) . 

As to the (finite sample) breakdown point FSBP of the weighted median (and at the same time for the 
scaled weighted MAD), we define tin = Wij/^2-, Wji , and for each i define the ordered weights w?.n such 

that w° > w° {2) > . . . > w° {k) . Then the FSBP in both cases is fc" 1 min{j = 1, . . . , k \ YjjLi w ? jo ) > k / 2 } 
which (for equivariant estimators) can be shown to be the largest possible value. So using weighted medians 
and MADs, we achieve a decent degree of robustness against outliers. E.g., assume we have 10 observations 
with weights 5 x 0.05; 3 x 0.1; 0.2; 0.25. Then we need at least three outliers (placed at weights 0.1,0.2,0.25, 
respectively) to produce a breakdown. 

5.4 Robustification of the E-step 

As indicated, in this step we cannot recur to asymptotics, but rather have to appeal to a theory particularly 
suited for this recursive setting. In particular, the SO-neighborhoods introduced in (4.3) turn out to be helpful 
here. 

5.4.1 Crucial Optimality Thm 

Consider the following optimization problem of reconstructing the ideal observation Y ld by means of the 
realistic/possibly contaminated Y' c on an SO-neighborhood. 

Minimax-SO problem Minimize the maximal MSE on an SO-neighborhood, i.e., find a Y T ° -measurable 
reconstruction /o for Y ld s.t. 

max^ E ro |F id - /(r ro )| 2 = min/ ! (5.1) 

The solution is given by 
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Theorem 5.1 (Minimax-SO) In this situation, there is a saddle-point (fo,P(f ) for Problem (5.1) 

f (y) := EY" + H p (D(y)), H b (z) = zmin{l, b/\z\} (5.2) 

Pt\dy) := ^{\D{y)\/p 1) + P y ' d (dy) (5.3) 

where p > ensures that J Pq (dy) = 1 and 

D(y) = j/-Ey ,d (5.4) 

T/ie uahfe of the minimax risk of Problem (5.1) is 

trCov(Y" ld )-(l-r)E ld [min{|L>(Y id )|, p} 2 ] (5.5) 

PROOF : See Appendix 7. //// 

The optimal procedure in equation (5.2) has an appealing interpretation: It is a compromise between the 
(unobservable) situations that (a) one observes the ideal T id , in which case one would use it unchanged and 
(b) one observes T dl , i.e.; something completely unrelated to Y ld , hence one would use the best prediction for 
Y ld (in MSE-sense) without any information, hence the unconditional expectation E Y ,d . The decision on how 
much to tend to case (a) and how much to case (b) is taken according to the (length of the) discrepancy D(Y TO ) 
between observed signal Y'° and E Y ld . If this length is smaller than p , we keep Y' c unchanged, otherwise 
we modify E Y" id by adding a clipped version of D(Y") . 

5.4.2 Robustification of Steps (RN), (E) 

In the Girsanov-/change of measure step we recall that the corresponding likelihood ratio here is just 

<7 _1 (x s _i)v5(o- _1 (x a _ 1 )(j/ a - /(x s _i)) ) 
A s := ^ j— > (5.6) 

Apparently A s can both take values close to , and, more dangerously, in particular for small values of <r(x s _i) , 
very large values. So bounding the A s is crucial to avoid effects like in Figure 3. 

A first (non-robust) remedy uses a data driven reference measure, i.e., instead of A/"(0, 1) , we use A/"(0, a 2 ) 
where a is a global scale measure taken over all observations, ignoring time dependence and state-varying 
a 's. A robust proposal would take a to be the MAD of all observations (tuned for consistency at the normal 
distribution). This leads to 



a 1 (yi s _ 1 )ip[a 1 (x s _i)(y s - /(x s _i))j 



A s := V _ 1 ,__! , '- (5.7) 

a 1 ip(a l y s ) 

Eventually, in both estimation and filtering/prediction, a cancels out as a common factor in nominator and 
denominator, so is irrelevant in the subsequent steps; its mere purpose is to stabilize the terms in numeric 
aspects. 

To take into account time dynamics in our robustification, we want to use Theorem 5.1, but to this end, we 
need second moments, which for A s need not exist. So instead, we apply the theorem to T ld = yA s , which 
means that A s = (Y id ) 2 is robustified by 



A s = (E ld ^A s + F 6 (V A s - E ld \JX S )Y (5.8) 

for Hb(x) — xmui{l,b/\x\} . Clipping height b in turn is chosen such that EA S = q, a — 0.95 for instance. 
As in the ideal situation E X s = 1 , in a last step with a consistency factor c' s determined similarly to Cj in the 
initialization step for the weighted MADs, we pass over to A° = c s A s such that E Aj = 1 . 

Similarly, in the remaining parts of the E-step, for each of the filtered processes generically denoted by G 
and the filtered one by G , we could replace G by 

G = E ld G + # b (G-E id G) (5.9) 
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for G any of Ji l , 0\ , and Tfc(f) and again suitably chosen b. 

It turns out though, that it is preferable to pursue another route. The aggregates T^(/) are used in the 
M-step in (3.10), but for a robustification of this step, it is crucial to be able to attribute individual influence 
to each of the observations, so instead we split up the terms of the filtered neg-loglikclihood into summands 
w i-i/OV\-(yj ~ fi) 2 /°f + log "*] for j = 1, . . . , fc, and hence we may skip a robustification of T£(/) . Similarly, 
as J\ % , 0\ are filtered observations of multinomial-like variables, a robustification is of limited use, as any 
contribution of a single observation to these variables can at most be of absolute value 1 , so is bounded 
anyway. Hence in the E-step, we only robustify A s . The splitting up the aggregates Tfc(f) into summands 
amounts to giving up strict recursivity, as for k observations in one batch, one now has to store the values 
Wi_jjO\ for j = l,...,k, and building up from j = 1 , at observation time j = jo within the batch, we 
construct Wij-j a , j = 1, ... ,jo from the values Wij-j -i , j = 1, . . . , jo — 1, so we have a growing triangle of 
weight values. This would lead to increasing memory requirements, if we had not chosen to work in batches of 
fixed length k , which puts an upper bound onto memory needs. 

5.5 Robustification of the (M)-Step 

As mentioned before, contrast to the (E)-step, in this estimation step, we may work with classical gross error 
neighborhoods (4.1) and with the standard i.i.d. setting. 

5.5.1 Shrinking Neighborhood Approach 

By Bienaymc, variance then usually is 0(l/n) for sample size n, while for robust estimators, the maximal bias 
is proportional to the neighborhood radius e . Hence unless e is appropriately scaled in n , for growing n , bias 
will dominate eventually for growing n . This is avoided in the shrinking neighborhood approach by setting 
e = e n = r / y/n for some r > 0, compare Rieder [1994]. Kohl ct al. [2010] sets e = e n = r/y/n for some initial 
radius r € [0, oo) . One could see this shrinking as indicating that with growing n , diligence is increasing so the 
rate of outliers is decreasing. This is perhaps overly optimistic. Another interpretation is that the severeness of 
the robustness problem with 10% outliers at sample size 100 should not be compared with the one with 10% 
outliers at sample size 10000 but rather with the one with 1% outliers at this sample size. 

In this shrinking neighborhood setting, with mathematical rigor, optimization of the robust criteria can be 
deferred to the respective IFs, i.e. instead of determining the IF of a given procedure, we construct a procedure 
to a given (optimally-robust) IF. 

This is achieved by the concept of asymptotically linear estimators (ALEs), as it arises canonically in most 
proofs of asymptotic normality: In a smooth ( L^ -differentiable) parametric model V — {Pg, 6 <G 0} for i.i.d 
observations Xi ~ Pg with open parameter domain C R d based on the scores 3 Kg and its finite Fisher 
information Xg = Eg AgAg~ , we define the set ^2(6) of influence functions as the subset of L$(Pg) consisting 
of square integrable functions ipg with d coordinates with Eg ipg = and Eg ipgAg = 1^ where 1^ is the 
d -dimensional unit matrix. Then a sequence of estimators S n — S n {X\ 1 . . . , X n ) is called an ALE if 

n 

s n = e + -J2MXi) + o P ?(n- 1/2 ) (5.10) 

for some influence function tpg G ^(^O ■ In the sequel we fix the true 9 € and suppress it from notation 
where clear from context. 

In particular, the MLE usually has influence function ^ MLE = I _1 A, while most other common estimators 
also have a representation (5.10) with a different ip . 

For given IF tp we may construct an ALE 9 n with tp as IF by a one-step construction, often called 
one-step-reweighting: To given starting estimator 6% such that i?° = 8^ — 9 = op«(n _1 / 4+0 ) we define 

1 ™ 

0n = 0S + -X>«o(Xj) (5.11) 

71 J'=l 

Then indeed 9 n = 9+^ X^?=i i ) 9(Xj)+R n and R n = op»(n -1 / 2 ) , i.e., 9 n forgets about 9^ as to its asymptotic 
variance and GES; however its breakdown point is inherited from 6® once is unbounded and ip is bounded. 
Hence for the starting estimator, we seek for 0° with high breakdown point. 



^Usually Ag is the logarithmic derivative of the density w.r.t. the parameter, i.e., Ag(rr) = d/d9logpg(x) . 
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Figure 4: Influence function of the MBRE at Af(0, 1) ; left panel: location part; right panel: scale part. 



For a more detailed account on this approach, see Rieder [1994]. 

5.5.2 Shrinking Neighborhood Approach With Weighted Observations 

As mentioned, coming from the E-step, not all observations yj are equally likely to contribute to state i , 
hence we are in a situation with weighted observations, where we may pass over to normed weights u>° = 



»*,} 



f%2ji Wi.j' summing up to 1 . 



Suppressing state index % from notation here, with these weights and 6 n the vector of weighted median and 
scaled weighted MAD for state i, (5.11) becomes 






(5.12) 



for 9 n again a two-dimensional ALE with location and scale coordinate and ipe(y) = wpiiy — / )/c) ' a t 
9 = (/, a) T , is the IF of the MBRE in the one-dimensional Gaussian location and scale model at A/"(0, 1) ; i.e., 

V(y) = bY(y)/\Y(y)\, Y(y) = (y, A(y 2 - 1) - a), (5.13) 

with numerical values for A, a, b up to four digits taken from R package RobLox, Kohl [2012], being 

A = 0.7917, a = -0.4970, b = 1.8546, (5.14) 

Influence function ip is illustrated in Figure 4. 

To warrant positivity of a and to maintain a high breakdown point even in the presence of inliers (driving 
a to essentially 0), for the scale component, instead of (5.12) we use the asymptotically equivalent form 



n 
a n = a° CXp |^ Y^ W °j V'scale ( (Vj - M° )/v„ ) 



(5.15) 



3 = 1 
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Using the MBRE-?/; from (5.13) and (5.14) on first glance could be seen as overly cautious. Detailed simulation 
studies, compare e.g. Kohl and Deigner [2010], show that for our typical batch lengths of 10-20, the MB RE 
also is near to optimal in the sense of Ricdcr ct al. [2008] in the situation where nothing is known about the 
true outlier rate (including, of course the situation where no outliers at all occur) . 

5.5.3 Robustification of Steps (Ml) & (M2) 

Now, we derive robust estimators of the model parameters fi and <Ji , i.e., we justify passage to weighted ALEs 
as in (5.12). In particular we specify the weights w" = w- 1 - therein. 

Recall, that the Ml-step in the classical algorithm gives the optimal parameter estimates stated in Theorem 
3.1. We now build ALEs, which can be achieved, when the MLEs of the parameters fi and o~i are stated as 
weighted sums of the observations tjk- 



Theorem 5.2 With 



*ii = !^)\ e ( 5 - 16 ) 



the optimal parameter estimates fi and &i are given by 

k 



fi = J2 w liW (5.17) 

i=i 

k 

v\ = Y. w lM-M 2 - ( 5 - 18 ) 



Proof : To find the optimal estimate for / consider 

k 



K ■= \\K with A,* :=cxp( 



(^-(/,x ; _ 1 )) 2 -(y i -(/,x i _ 1 )) : 



2(a,x,_ 1 )2 
Up to constants irrelevant for optimization, the filtered log-likelihood is then 

N k 

E[ln(A* fe ) | F y k ] =Y.H<M fif ( 5 - 19 ) 

i=l 1=1 

Maximising the log-likelihood E[ln(A* k ) \ F k ] in fi hence leads to the optimal parameter estimate 

k k 

J2(xi-ue l )(2m%-f?) = =► £(*) = £» 

l=i i 

In an analogue way, for o~i we define 

a+ tt \+ s+v, \+ (^ x '-i) „ (iyi- </» x i-i)) 2 (vi- (f^i-i)) 2 

A k := X l Wltn ^l : = 7^ \~ CX P TTi \9 ?w^ \o — 

f- = \ l (<r,xj_i) V 2(ct,x,_ 1 ) 2 2(a,x z _ 1 ) 2 

and hence, again up to irrelevant terms 

N k , _ JN 2 

E[ln(A+) | Ft] =EE^-i' e »>[ 2< 7 2 +1 °g^)] ( 5 ' 2 °) 

i=l 1=1 i 

From this term, which has to be minimised for Oi we get 

k k 

J2(- Xl _ 1 ,e i )[~( yi -f i f + ^\ =o =* 3f = £>?,(w-£) 2 

i=i l i=i 



1G 

Note that (5.20) takes its minimum at the same place as 

N k 



k(AjT) = EE<«K»-/<) 9 - ?] a ( 5 - 21 ) 



1=1 1=1 

nn 

For the robustification of the parameter estimation (step Ml) we now distinguish two approaches. The first 
robustification is utilized in the first run over the first batch of data and is therefore called the initialization 
step Ml. The robust estimates of the parameters from the second batch onwards are then achieved through a 
weighted ALE. 

Theorem 5.3 The robust parameter estimates for the model parameters /j and o~i in the (1) initialization 
and (2) all following batches are given by 

1. Replacing, for initialization, the squares by absolute values in (5.19) and in (5.21), fi and <7i are the 
weighted median and scaled weighted MAD, respectively, of the yi , I — 1, ... ,k with weights (u)^); . 

2. For further batches, the weighted MBRE is obtained as a one-step construction with the parameter estimate 
(fii a i) f T om the previous batch as starting estimator and with IF ip — (V'loc, V'scaic) from (5.13), (5.14), 
i.e., 



fi = /f + ^°E<^-((2/z-/°)Ax°) (5.22) 

l=i 

k 

&i = «T?exp(E<^-cata((w- < / ?)/*?)) (5.23) 



i=i 
Proof : 

1. Initialization: With absolute values instead of squares, (5.19) becomes 

k 

fi = argmin Vw°,|2/i - /;| 
f* 1=1 

Now if w^ ; is constant in I , this leads to the empirical median as unique minimizer justifying the name. 
For the scaled weighted MAD, the argument parallels the previous one, leading to consistency factor 
c, = $(3/4) for $ the cdf of Af(0, 1) . 

2. Ml in further batches: Apparently, by definition, (/»,0't) is an ALE, once we show that ip is square 
integrablc, E(^) = 0, E(^A') = I2 . The latter two properties can be checked numerically, while by 
boundedness square integrability is obvious. In addition it has the necessary form of an MBRE in the 
i.i.d. setting as given in [Rieder, 1994, Thm 5.5.1]. To show that this also gives the MBRE in the context 
of weighted observations, we would need to develop the theory of ALEs for triangular schemes similar 
to the one in the Lindeberg Feller Theorem. This has been done, to some extent in [Ruckdcschcl, 2001, 
Section 9]. In particular, for each state i, we have to assume a Noether condition excluding observations 
overly influential for parameter estimation in this particular state, i.e., 

k 

lim max >°, ;fe ) 2 /E««) 2 = (5.24) 

.7=1 

We do not work this out in detail here, though. 

//// 
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Figure 5: Robust parameter estimates for monthly MSCI returns between 1994 and 2009 — analogue to Figure 1 



Consider again our filtering algorithm and recall, that the filter runs over the data set in batches of roughly 
ten two fifty data point. To determine the ALE for our parameters, we have to calculate the weights w®i = 

(x(_i, ei)/0\. Therefore, our algorithm has to know all values of x; from 1 to k in each batch. With this, 
our robustification of the algorithm cannot obtain the same recursiveness as the classical algorithm. However, 
since we only have to determine and save the estimates of x^ in each batch, the algorithm still is numerically 
efficient, the additional costs are low. In general, the ALEs are fastly computed robust estimators, which lead 
in our case to a fast and, over batches, recursive algorithm. 

The additional computational burden to store all the weights w®i arising in the robustification of the 
Ml-step is more than paid off by the additional benefits they offer for diagnostic purposes beyond the mere 
EM-algorithm: They tell us which of the observations, due to their likelihood to be in state i carry more 
information on the respective parameters /$ and ai than others. The same goes for the terms ipe{yi) which 
capture the individual information of observation yi for the respective parameters. Even more though, the 
coordinates of ipg(yj)/\ipg(yi)\ tell us how much of the information in observation yi is used for estimating /j 
and how much for Oi . In addition the function y i—> w® i'4>g{y) can be used for sensitivity analysis, telling us 
what happens to the parameter estimates for small changes in observation y . Finally, using the undipped, 
classically optimal IF of the MLE, but evaluated at the robustly estimated parameters, we may identify outliers 
not fitting to the "usual" states. 



6 Implementation and Simulation 



The classical algorithm as well as the robust version are implemented in R; we plan to release the code in 
form of a contributed package on CRAN at a later stage. The implementation builds up on, respectively uses 
contributed packages RobLox and mclust. At the time of writing we are preparing a thorough simulation study 
to explore our procedure in detail and in a quantitative way. For the moment, we restrict ourselves to assess 
the procedure in a qualitative way, illustrating how it can cope with a situation like in Figure 3. 

In Figure 5, we see the paths of the robust parameter estimates for / , a , and II ; due to the new initialization 
procedure, the estimates — in particular those for II — differ a little from those of Figure 1. Still, all the estimators 
behave very reasonable and are not too far from the classical ones. 

In the outlier situations from Figures 2 and 3, illustrated in Figures 6, the estimates for / and a remain 
stable at large as desired. The estimates for IT however do get irritated, essentially flagging out one state as 
outlier state. Some more work remains to be done to better understand this and to see how to avoid this. 

Aside from this, our algorithm already achieves its goals; in particular, our procedure never breaks down — 
contrary to the classical one. 
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Figure 6: Robust parameter estimates for monthly MSCI with planted outliers — analogue to Figures 2 and 3 



7 Conclusion 

In financial applications, we often have to consider the case of outliers in our data set, which can occur from 
time to time e.g., due to cither wrong values in the financial database or unusual peaks or lows in volatile 
markets. Conventional parameter estimation methods cannot handle these specific data characteristics well. 



Contribution of this paper: Our contribution to this issue is two fold: 

First, we analyse step by step the general filter-based EM-algorithm for HMMs by Elliott [1994] and high- 
light, which problems can occur in case of extreme values. We extend the classical algorithm by a new technique 
to find initial values, taking into account the N— state setting of the HMM. In addition, for numerical reasons 
we use a data-driven reference measure instead of the standard normal distribution. 

Second, we have proposed a full robustification of the classical EM-algorithm. Our robustified algorithm is 
stable w.r.t. outliers in the observation process and is still able to estimate processes of the Markov chain as 
well as optimal parameter estimates with acceptable accuracy. The robustification builds up on concepts from 
robust statistics like SO-optimal filtering and asymptotic linear estimators. Due to the non-iid nature of the 
observations as apparent from the non-uniform weights wf j attributed to the observations, these concepts had 
to be generalized for this situation, leading to weighted medians, weighted MADs, weighted ALEs. Similarly, 
the SO-optimal filtering (with focus on state reconstruction) is not directly applicable for robustifying the 
Radon-Nikodym terms X s , where we (a) had to clean the "observations" themselves and (b) had to pass over 
to yAs f° r integrability reasons. 

Our robust algorithm is computationally efficient. Although complete recursivity cannot be obtained, the 
algorithm runs over batches and keeps its recursivity there additionally storing the filtered values of the Markov 
chain. This additional burden is outweighed though by the benefits of these weights and influence function 
terms for diagnostic purposes. As in the original algorithm, the model parameters, which are guided by the state 
of the Markov chain, are updated after each batch, using a robust ALE however. The robustification therefore 
keeps the characteristic of the algorithm, that new information, which arises in the observation process, is 
included in the recent parameter update — there is no forward-backward loop. 
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The forecasts of asset prices, which are obtained through the robustificd parameter estimates, can be utilized 
to make investment decisions in asset allocation problems. 

To sum up, our forecasts are robust against additive outliers in the observation process and able to handle 
switching regimes occurring in financial markets. 



Outlook: It is pretty obvious how to generalize our robustification to a multivariate setting: The E-step is 
not affected by multivariate observations, and the initialization technique using Gaussian Mixture Models ideas 
already is available in multivariate settings. Respective robust multivariate scale and location estimators for 
weighted situations still have to be implemented, though, a candidate being a weighted variant of the (fast) 
MCD-estimator, compare Rousseeuw and Leroy [1987], Rousseeuw and van Driessen [1999]. 

Future work will hence translate our robustification to a multivariate setting to directly apply the algorithm 
to asset allocation problems for portfolio optimisation. Furthermore, investment strategies shall be examined 
within this robust HMM setting to enable investors a view on their portfolio, which includes possible outliers 
or extreme events. The implementation of the algorithms shall be part of an R package, including a thorough 
simulation study of the robustified algorithm and its application in portfolio optimisation. 

Finally, an automatic selection criterion for the number of states to retain would be desirable which is a 
question of model selection, where criteria like BIC have still to be adopted for robustness. 
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A Proofs 

Proof to Theorem 5.1 



(1) Let us solve raaxeu min/ [. . .] first, which amounts to mingu E re [| E ro [Y ld |F ro ]| ] . For fixed element P Y assume a 
dominating a -finite measure /i, i.e., p ^> P , fi>P ; this gives us a /i -density q(y) of P . Determining the 
joint (real) law P Y ' Y (dy, dy) as 

P(Y' d G A, Y" G B) =jl A (y) l B (y)[(l-r) l(y = y) + rq(y)] p y ' d (y) n(dy)»(dy) (A.l) 

we deduce that p,(dy) -a.c. 

E ry id \Y"=y] = r< i(y) EY ' d +( 1 - r )yp Y ' d (y) = . «ig(y)+°2(y) , A2 > 

rq(y) + (l -r)p Yid (y) ' a 3 q(y)+a 4 (y) 



Hence we have to minimize 



1 a 3 q(y) + a 4 (y) ^ y > 



in Mo — {q G L 1 (p)\ q > 0, fqd/j, — 1} . To this end, we note that F is convex on the non-void, convex cone 
M = {q G L\{n) | q > 0} so, for some p > , we may consider the Lagrangian 

L p(i) ■= F{q) + p qd^i 

for some positive Lagrange multiplier p. Pointwise minimization in y of Lp(q) gives 

q s (y)= 1 -^(\D(y)\/s-l) + p Y (y) 

for some constant s = s(p) = (|EY ld | 2 + p/r) 1 ' 2 , Pointwise in y, q s is antitone and continuous in s > and 
lmis^ojoo] q 3 (y) = oo [0] , hence by monotone convergence, 



H(s) = / q a (y)p{dy) 
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too, is antitone and continuous and lim s ^ [oo] H(s) = oo[0] . So by continuity, there is some p £ (0, oo) with H(p) — 1 . 
On Mo , / qdfj, — 1 , but q p = q s = P £ -Mo and is optimal on M D Mo hence it also minimizes F on Mo . In particular, 
we get representation (5.3) and note that, independently from the choice of /i , the least favorable P Y is dominated 
according to P Y <!C P Y , i.e.; non-dominated P Y are even easier to deal with. 

As next step we show that 

maxgy min/ [...] = min/ maxg M [. . .] (A. 3) 

To this end we first verify (5.2) determining fo(y) as fo(y) = E rc .p[X|y rc = y] . Writing a sub /superscript "re;P" for 
evaluation under the situation generated by P — P Y and P for P Y , we obtain the the risk for general P as 

MSE re;P [/ (F re ' P )] = (l-r)E ld |y ld -/ (r d )| 2 + rtrCovF ,d + 

+ rEpmin(|D(y d,; ' 9 )| 2 ,p 2 ) (A.4) 

This is maximal for any P that is concentrated on the set { \D(Y dl '" q )\ > p} , which is true for P . Hence (A. 3) follows, 
as for any contaminating P 

MSE re; p[/ (F reiP ] < MSE ro;P [/oOr o;P )] 

Finally, we pass over from dhi to U : Let f r , P r denote the components of the saddle-point for dU(r) , as well as 
p(r) the corresponding Lagrange multiplier and w r the corresponding weight, i.e., w r = w r (y) = min(l, p(r) / \D(y)\) . 
Let R(f,P,r) be the MSE of procedure / at the SO model dU(r) with contaminating P Y —P. As can be seen 
from (5.3), p(r) is antitone in r; in particular, as P r is concentrated on {|D(Y)| > p(r)} which for r < s is a subset 
of {\D(Y)\ > p(s)}, we obtain 

R(f 3 ,P s ,s) = R(f a ,P r ,s) forr<s 

Note that R(f s ,P, 0) = R(f a , Q, 0) for all P, Q— hence passage to R(f a , P, r) = R(f 3 ,P, r) - R(f 3 ,P, 0) is helpful— and 
that 

trCovF ld = E ld [trCov ld [Y ,d |Y" d ] + |D(r ,d )| 2 ] (A.5) 

Abbreviate w a (Y id ) = 1 - (l - w s (Y id )) 2 > to see that 

R(f s ,P, r) = r{ E ld [\D(Y id )\ 2 w s (Y id )] + E P min(\D(Y ld )\, p(s)) 2 } < 
< r{E ld [\D(Y' d )\ 2 w s (Y' d )] + p(sf } = R{f s ,Pr,r) < R(f s ,P s ,s) 

Hence the saddle-point extends to U(r) ; in particular the maximal risk is never attained in the interior U(r) \ dU(r) . 
(5.5) follows by plugging in the results. 

//// 
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